Lemmatization of Polish Person Names

نویسندگان

  • Jakub Piskorski
  • Marcin Sydow
  • Anna Kupsc
چکیده

The paper presents two techniques for lemmatization of Polish person names. First, we apply a rule-based approach which relies on linguistic information and heuristics. Then, we investigate an alternative knowledge-poor method which employs string distance measures. We provide an evaluation of the adopted techniques using a set of newspaper texts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full form-lemma pairs respectively. W...

متن کامل

Lemmatization of Multi-word Common Noun Phrases and Named Entities in Polish

In the paper we present a tool for lemmatization of multi-word common noun phrases and named entities for Polish called PoLem1. The tool is based on a set of manually crafted rules and heuristics utilizing a set of dictionaries (including morphological, named entities and inflection patterns). The accuracy of lemmatization obtained by the tool reached 97.99% on a dataset with multi-word common ...

متن کامل

ENIAM: Categorial Syntactic-Semantic Parser for Polish

This paper presents ENIAM, the first syntactic and semantic parser that generates semantic representations for sentences in Polish. The parser processes non-annotated data and performs tokenization, lemmatization, dependency recognition, word sense annotation, thematic role annotation, partial disambiguation and computes the semantic representation.

متن کامل

Proper Names in Dialogs from the Warsaw Transportation Call Center

In the paper we present the method of automatic recognition and annotation of proper names which occur in dialogs gathered at the Warsaw city transportation information center. We describe different types of proper names and how people use them in dialogs. We present rules of automatic recognition and lemmatization of proper names in the transportation domain.

متن کامل

Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts

In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally recognition of common proper names on the basis of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007